Article Outline
情景描述:
平时工作中经常碰到编码、解码、乱码……类似的问题不胜其烦,如街边小广告一般异常讨厌,需要花时间好好整理一番,“一”绝后患。
<!--more-->
str(s)与unicode(s)
str(s)和unicode(s)是两个工厂方法,分别返回str字符串对象和unicode字符串对象; str(s)是s.encode(‘ascii’)的简写; unicode(s)是s.decode(‘ascii’)的简写;
str
str(object='')
str(object=b'', encoding='utf-8', errors='strict')
object - object whose informal representation is to be returned
encoding - Defaults of UTF-8. Encoding of the given object
errors - response when decoding fails. There are six types of error response:
strict - default response which raises a UnicodeDecodeError exception on failure
ignore - ignores the unencodable unicode from the result
replace - replaces the unencodable unicode to a question mark ?
xmlcharrefreplace - inserts XML character reference instead of unencodable unicode
backslashreplace - inserts a \uNNNN espace sequence instead of unencodable unicode
namereplace - inserts a \N{...} escape sequence instead of unencodable unicode
>>> s3 = u"你好"
>>> s3
u'\u4f60\u597d'
>>> str(s3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
上面s3是unicode类型的字符串,str(s3)相当于是执行s3.encode(‘ascii’)因为“你好”两个汉字不能用ascii码来表示,所以就报错了,指定正确的编码:s3.encode(‘gbk’)或者s3.encode("utf-8")就不会出现这个问题了。
类似的unicode有同样的错误:
>>> s4 = "你好"
>>> unicode(s4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
unicode(s4)等效于s4.decode(‘ascii’),因此要正确的转换就要正确指定其编码s4.decode(‘gbk’)或者s4.decode("utf-8")。
###
In [20]: '中文'
Out[20]: '\xe4\xb8\xad\xe6\x96\x87'
In [21]: u'中文'
Out[21]: u'\u4e2d\u6587'
In [29]: print '中文'
中文
In [30]: print u'中文'
中文
In [34]: print '\u4e2d\u6587'
\u4e2d\u6587
In [35]: print u'\u4e2d\u6587'
中文
In [26]: u'中文'.encode('gb2312')
Out[26]: '\xd6\xd0\xce\xc4'
In [27]: u'中文'.encode('gbk')
Out[27]: '\xd6\xd0\xce\xc4'
In [28]: u'中文'.encode('utf8')
Out[28]: '\xe4\xb8\xad\xe6\x96\x87'
问题:
In [41]: '中文'.encode('utf8')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-41-94bb800b6371> in <module>()
----> 1 '中文'.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
解决办法:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')